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Abstract. Symbolic Data Analysis is based on special descriptions of data - symbolic 
objects (SO). Such descriptions preserve more detailed information about units and their 
clusters than the usual representations with mean values. A special kind of symbolic 
object is a representation with frequency or probability distributions (modal values). 
This representation enables us to consider in the clustering process the variables of all 
measurement types at the same time. In the paper a clustering criterion function for SOs is 
proposed such that the representative of each cluster is again composed of distributions of 
variables’ values over the cluster. The corresponding leaders clustering method is based 
on this result. It is also shown that for the corresponding agglomerative hierarchical 
method a generalized Ward’s formula holds. Both methods are compatible they are 
solving the same clustering optimization problem. 

The leaders method efficiently solves clustering problems with large number of units; 
while the agglomerative method can be applied alone on the smaller data set, or it could 
be applied on leaders, obtained with compatible nonhierarchical clustering method. Such 
a combination of two compatible methods enables us to decide upon the right number of 
clusters on the basis of the corresponding dendrogram. 

The proposed methods were applied on different data sets. In the paper, some results 
of clustering of ESS data are presented. Symbolic objects and Leaders method and 
Hierarchical clustering and Ward’s method and European social survey data set 


1. Introduction 


In traditional data analysis a unit is usually described with a list of (numerical, ordinal 
or nominal) values of selected variables. In symbolic data analysis (SDA) a unit of a 
data set can be represented, for each variable, with a more detailed description than only 
a single value. Such struct u red descriptions are usually called symbolic objects (SOs) 


Bock, H-H., Diday, E. ( 2000l ). Billard, L., Diday, E,| ( 20061 )1. A special type of symbolic 
objects are descriptions with frequency or probability distributions. In this way we can at 
the same time consider both - a single value variables and variables with richer descriptions. 
Computerization of data gathering worldwide caused the data sets getting huge. In order 
to be able to extract (explore) as much information as possible from such kind of data the 
predefined aggregation (preclustering) of the raw data is getting common. 

For example if a large store chain (that records each purchase its customers make) 
wants information about patterns of customer purchases, the very likely way would be to 
aggregate purchases of customers inside a selected time window. A variable for a customer 
can be a yearly shopping pattern on a selected item. Such a variable could be described 
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with a single number (average yearly purchase) or with a symbolic description - purchases 
on that item aggregated according to months. The second description is richer and allows 
for better analyses. 

In order to retain and use more information about each unit during the clustering process, 
we adapted two classical clustering methods: 


l eader s method, a genera lization of k-means method ( Hartigan. J.A. ( 19751 ). lAnderberg. M.R 
(ll97:l ). lDida\~E] (Il979l ll 


Ward’s hierarchical clustering method ( Ward. J.H. (1963)). 


Both methods are compatible - they are based on the same criterion function. Therefore 
they are solving the same clustering optimization problem. They can be used in combi¬ 
nation: using the leaders method the size of the set of units is reduced to a manageable 
number of leaders that can be further clustered using the compatible hierarchical clustering 
method. It enables us to reveal the relations among the clusters/leaders and also to decide, 
using the dendrogram, upon the right number of final clusters. 

Since clustering objects into similar groups plays an important role in the exploratory 
data analysis, many clustering approaches have been developed in SDA. Symbolic object 
can be compared using many different dissimilarities with different properties. Based on 
them many clustering approaches we re developed. Review of them can be found in ba¬ 
sic bo o ks and papers from the fiel d : iBock. H-H.. Didav. E. J 2000 Billard- L.. Didav. E 


(2003), Billard. L.. Didav. E. ( 20061 ). IPidav. E.. Noirhomme-Fraiture. m' ( 20081 ). and Noirhomme-Fraiture. 
(2011). Although most attention was given to clustering of interval data (de Carvalho, 


are close to our approach (Gowda. K.C.. 

Didav. E. (1991 

. Ichino. M.. Yaemchi, H. (199411. 

Korenjak-Cerne, S., Batageli 

V. 

(1998)), 

Verde, R. et al. ( 

2000ll.|Koreniak-Cerne. S., Batagejj, V. 

|2002l), Irpino. A., Verde. R. 

(2006). Verde, R., Irpino. A 

(20101). 

The very recent paper of 

Irpino. A. et al. (2014)1 describes the dvnamic clustering ap- 


proach to histogram data based on Wasserstein distance. This distance allows also for 
automatic computation of relevance weights for variables. The approach is very appealing 
but cannot be used when clu stering general (not necess arily numerical) modal valued data. 
In the paper de Carvalho. F.A.T.. de Sousa. R.M.C.R. ( 20061 ) the authors present an ap¬ 
proach with dynamic clustering that could be (with a pre-processing step) used to cluster 
any type of symbolic data. For the dissimilarities adaptive squared Euclidean distance is 
used. One drawback to this approach is in the fact that when us ing dyna mic clus tering one 
has to determine the number of clusters in advance. In the paper Kim. J.. Billard. L. ( 20 111 ) 
the authors propose Ichino- Yagu chi dissimilarity measure extended to histogram data and 


in the paper iKim. J.. Billard. L 


(2012:) two measures (Ichino-Yaguchi and Gowda-Diday) 
extended to general modal valued data. They use these measures with divisive clustering 


algorithm and propose two cluster validity inde xes that help one decide for the optimal 
number of final clusters. In the paper iKim. .T.. Billard. L.l ( 20131 ) even more general dis¬ 
similarity measures are proposed to use with mixed histogram, multi valued and interval 
data. 
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The aim of this paper is to provide a theoretical basis for a generalization of the compat¬ 
ible leaders and agglomerative hierarchical clustering methods for modal valued data with 
meaningful interpretations of clusters’ leaders. The novelty in our paper is in proposed ad¬ 
ditional dissimilarity measures (stemming directly from squared Euclidean distance) that 
allow the use of weights for each SO (or even its variable’s component) in order to consider 
the size of each SO. It is shown that each of these dissimilarities can be used in a leaders 
and agglomerative hierarchical clustering method, thus allowing the user to chain both 
methods. In dealing with big data sets we can use leaders methods to shrink the big data 
set into a more manageable number of clusters (each represented by its leader) which can 
be further clustered via hierarchical method. Thus the number of final cluster is easily 
determined from the dendrogram. 

When clustering units described with frequency distributions, the following problems 
can occur: 


Problem 1: The values in descriptions of different variables can be based 
on different number of original units. 


A pos sible approach how to deal with this problem is presented in an application in 
Koreniak-Cerne. S. et ah 1 fennh . where two related data sets (teachers and their students) 


are combined in an ego-centered network, which is presented with symbolic data descrip¬ 
tion. 


Problem 2: The representative of a cluster is not a meaningful represen¬ 
tative of the cluster. 

For example, this prob lem appears when clusterin g units are age-sex struc t ures of the 
world’s countries (e.g. Irpino. A.. Verde. R. ( 20061 ): Kosmelj. K.. Billard. L. ( 2011 )1. In 
Koreniak-Cerne. S. et al. (201a) authors used a weighted agglomerative clustering approach, 
where clusters’ representatives are real age-sex structures, for clustering population pyra¬ 
mids of the world’s countries. 

Problem 3: The squared Euclidean distance favors distribution compo¬ 
nents with the largest values. 


In clustering of citation patterns Keizar, N. et al.j (2011) showed that the selection of the 
squared Euclidean distance doesn’t give very informative clustering results about citation 
patterns. The authors therefore suggested to use relative error measures. 

In this paper we show that all three problems can be solved using the generalized leaders 
and Ward’s methods with an appropriate selection of dissimilarities and with an appro¬ 
priate selection of weights. They produce more meaningful clusters’ representatives. The 
paper also provides theoretical basis for compatible usage of (both methods and extends 
methods with alternative dissimilarities (proposed in Keizar. N. et al- (.2011) for classical 
data representation) on modal valued SOs with general weights (not only cluster sizes). 

In the following section we introduce the notation and the development of the adapted 
methods is presented. The third section describes an example analysis of the European 
Social Survey data set ESS ( 2010 ). Section four concludes the paper. In the Appendix 
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we provide proofs that the alternative dissimilarities can also be used with the proposed 
approach. 


2. Clustering 

The set of units U consists of symbolic objects (SOs). An SO X is described with a 
list of descriptions of variables = 1 In our general model, each variable is 

described with a list of values f Xi 

x = [fxi 5 f®2 J ■ ■ ■ J £r m ] 5 

where m denotes the the number of variables and 

= [fxi 1) /a;;2) ■ ■ ■ j fxikf\i 

with ki being the number of terms (frequencies) f XiJ of a variable V), i = 1,..., m. 

Let n Xi be the count of values of a variable Vi 

ki 

Ux i = 

3 =1 

then we get the corresponding probability distribution 



In general, a frequency distribution can be represented as a vector or graphically as 
a barplot (histogram). To preserve the same description for the variables with different 
measurement scales, the range of the continuous variables or variables with large range has 
to be categorized (partitioned into classes). In our model the values NA (not available) are 
treated as an additional category for each variable, but in some cases use of imputation 
methods for NAs would be a more recommended option. 

Clustering data with leaders method or hierarchical clustering method are two ap¬ 
proaches for solving the clustering optimization problem. We are using the criterion func¬ 
tion of the following form 

P(C) = P(C). 

cec 

The total error P( C) of the clustering C is the sum of cluster errors p(C ) of its clusters 

C € C. 

There are many possibilities how to express the cluster error p(C). In this paper we 
shall assume a model in which the error of a cluster is a sum of differences of its units from 
the cluster’s representative T. For a given representative T and a cluster C we define the 
cluster error with respect to T : 

P(C,T)= ^d(X,T), 

xec 

where d is a selected dissimilarity measure. The best representative Tp is called a leader 

Tc = arg min T p(C', T). 
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Then we define 


(1) p(C) = p{C, T C ) = mm £ d(X, T ). 

xec 

Assuming that the leader T has the same structure of the description as SOs (i.e. it 
is represented with the list of nonnegative vectors t t of the size ki for each variable Vi). 
We do not require that they are distributions, therefore the representation space is T = 
(M+) fcl x (M+) fc2 x • • • x (M+) fc ™. 

We introduce a dissimilarity measure between SOs and T with 


d(X, T) = J_l aidi{X, T ), a t > 0, ^ 


on = 1 , 


where ati are weights for variables (i.e. to be able to determine a more/less important 
variables) and 

ki 

di(X, T) — N ( ^x;jd{pxjj - tjj ), w Xij ^ 0 
j = 1 

where w Xi j are weights for each variable’s component. This is a kind of a generalization of 
the squared Euclidean distance. Using an alternative basic dissimilarity 5 we can address 
the problem 3. Some examples of basic dissimilarities 5 are presented in Table [D It lists 
the basic dissimilariti es be t wee n the unit’s component and the leader’s component that 
were proposed in iKejzar. N. et ah (2011) for classical data representation. In this paper 
we extend them to modal valued SOs. 

The weight w Xi j can be for the same unit X different for each variable V) and also for each 
of its components. With weights we can include in the clustering process different number of 
original units for each variable (solving problem 1 and/or problem 2) and they also allow a 
regulation of importance of each variable’s category. For example, the population pyramid 
of a country X can be represented with two symbolic variables (one for each gender), where 
people of each gender are represented with the distribution over age groups. Here, w Xl is 
the number of all men and w X2 is the number of all women in the country X. 

To include and preserve the information about the variable distributions and their size 
throughout the clustering process ( problem 2), the following has to hold when merging two 
disjoint clusters C u and C v (a cluster may consist of one unit only): 


P(uv)i 


1 


w (uv)i 


*(lLV) 


where p( u „v denotes the relative distribution of the variable V) of the joint cluster Cr uv \ = 
C u U C v , f(uv)i the frequency distribution of variable V) in the joint cluster and W( uv \ t the 
weight (count of values) for that variable in the joint cluster. 

Although we are using the notation f which is usually used for frequencies, other in¬ 
terpretations of f and w are possible. For example w is the money spent, and f is the 
distribution of the money spent on a selected item in a given time period. 
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2.1. Leaders method. Leaders method, also called dynamic clouds method (IDidav. E 


1979) , is a generalization of a popular nonhierarchical clustering k-means method (lAnderberg. 
1973: lHartigan. J.A. . 19751) . The idea is to get the ’’optimal” clustering into a pre-specified 


M.R 


number of clusters with an iterative procedure. For a current clustering the leaders are 
determined as the best representatives of its clusters; and the new clustering is determined 
by assigning each unit to the nearest leader. The process stops when the result stabilizes. 
In the generalized approach, two steps should be elaborated: 


• how to determine the new leaders; 

• how to determine the new clusters according to the new leaders. 


2.1.1. Determining the new leaders. Given a cluster C, the corresponding leader Tc £ T 
is the solution of the problem (Eq. |T]) 

Tc = arg min T d(X, T ) = arg min T EE atidi(X, T ) 

xec xeC i 

= arg min T ^ai ^ di(X,T) = [arg min t . ^ di(X,T)]™ =1 
i xeC xeC 

Denoting Tc = [t*, t^,..., t£j, where t* £ = 1,2, we get the following 

requirement: t* = arg min t . YlxeC ^( x u D). 

Because of the additivity of the model we can observe each variable separately and 
simplify the notation by omitting the index i. 

k 

t* = arg min t ^ d(x,t) = arg min t ^ y^w xj 6(p xj ,tj) 
xec xecj =l 


k 

= arg min t ^ ^ w xj 8(p xj ,tj ) = [arg min^ eE ^ w xj 5(p xj ,tj)\ j=x 
j =l xeC xeC 

Since in our model also the components are independent we can optimize component-wise 
and omit the index j 

(2) t* = arg min teK ^ w x 5(p x ,t) 

xeC 

t* is a kind of Frechet mean (median) for a selected basic dissimilarity 5. 

This is a standard optimization problem with one real variable. The solution has to 
satisfy the condition 

d 

— Wxd(p x ,t) = o 

t xeC 


( 3 ) 
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Table 1. The basic dissimilarities and the corresponding cluster leader, 
the leader of the merged clusters and dissimilarity between merged clusters. 
Indices i and j are omitted. 


5(x,t) t* c 




<*1 ipx - t)' 2 


WC 

( Px-t \2 Qc_ 

h 
h 


(Px-t\Z 

V t ) Pc 


( Px-t ) 2 

t 

( Px-t \ 2 He 

^ Px J Gc 


(j Px-t ) 2 

Px He 


(Px-t ) 2 

Pxt 


W u U + W v V 

Wu+Wy 

uP u + 

Py ~f~ Py 

/ + V 2 W v 

W u + W v 

H u + H v 
H u l 

U , 

w u + 


Wu_U^f u _ v P 

W u +Wv V ' 

P u ( U—Z )2 I P„ / i;—z \2 

(y—z ) 2 . {v—z ) 2 

- - L 7/7.,. --L. 


W 


Hu + P r 

I Py P Py 
\ / | I 

V P? + V* 


G u (u - z ) 2 + G„(i; - zf 

(u-z) 2 . {v-z) 2 

w u y -^ + w v y —^- 


Pu ju-z) 2 , P„ {v-z) 2 

U UZ ' 'll 117 


w c 

5 

w 

II 


xeC 

Pc 

= WxPx 


XeC 

Qc 

= Wxp x 


xeC 

H C 

w x 

^— d Px 

xeC 1 

G c 

_ \ Wx 

2—j p2 

xec I x 


Leaders for <5i. Here we present the derivation only for the basic dissimilarity 5\. The 
derivations for other dissimilarities 5 from Table Q] are given in the appendix. 

The traditional clustering criterion function in /c-means and Ward’s clustering meth¬ 
ods is based on the squared Euclidean distance dissimilarity d that is based on the basic 
dissimilarity 5\(p x ,t) = (p x — t) 2 . For it we get from d3| 

d 

0 = ^2 w x-p:(Px ~ t ) 2 = -2 ^ w x(Px ~ t) 

xec at xec 
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Therefore 

(4) 


t* = 


Yhx^C w xPx 


■ w x 


Pc 

w c 


^2xeC 

The usage of selected weights in the dissimilarity provides meaningful cluster repre¬ 
sentations, resulting from the following two properties: 

Property 1. Let w Xi j = w Xi then for each i = 1,..., m: 

= ^tE E w *iP*d = — E = -E = 1 

j =i ° j=i AeC ° AeC i=i ° xec 

If the weight w Xi j is the same for all components of variable V), w Xi j = w Xi , then for di the 
leaders’ vectors t* are distributions. 

Property 2. Let further w Xi j = n Xi then for each cluster C, i = 1,..., m and j = 1,..., kf. 


f* — 
l Cij ~ 


J2x&c n xjPxij _ Ylxec fxi 
Ylx&c n xi 


xec 


fdj 

—- = PCij 

na 


Note that in this case the weight w Xi j is constant for all components of the same variable. 
This result provides a solution to the problem 2. 

For each basic dissimilarity 5 the corresponding optimal leader, the leader of the merged 
clusters, and the dissimilarity D between clusters are given in Table [H 

2.1.2. Determining new clusters. Given leaders T the corresponding optimal clustering C* 
is determined from 

(5) P(C*) = Y, = £ d(X,T c * (x) ), 

veu xeu 

where c*(X) = arg min k d(X,Tk). We assign each unit X to the closest leader T*. € T. 

In the case that some cluster becomes empty, usually the most distant unit from some_ 


other cluster is assigned to it. In the current version of R package clamix (IBatageli. V.. Kejzar, N 


20ld ) the most dissimilar unit from all the cluster leaders is assigned to the empty cluster 


2.2. Hierarchical method. The idea of the agglomerative hierarchical clustering proce¬ 
dure is a step-by-step merging of the two closest clusters starting from the clustering in 
which each unit forms its own cluster. The computation of dissimilarities between the new 
(merged) cluster and the remaining other clusters has to be specified. 

2.2.1. Dissimilarity between clusters. To obtain the compatibility with the adapted lead¬ 
ers method, we define the dissimilarity between clusters C u and C v , C u n C v = 0, as 
( Batageli. V.l . Il988 ) 

(6) D(C U , C v ) = p(C u U C v ) - p(C u ) - p{C v ). 

Let us first do some general computation, u.; and v,; are i-th variables of the leaders U 
and V of clusters C u and C v , and z* is a component of the leader Z of the cluster C U UC V . 
Then 
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E 


Oii 


^ ^ QiDiiCui C v 


u — u = 


D{C U , C v ) = p(c u U C v ) - p(C u ) - p(C v ) = 

Y di(X, Z) — Y di{X,U)~ Y M x > v ) 

LYeC u uC„ xeCu x&c v 

Since C u n C v = 0 we have 

(7) Di(C u , C v )= Y &(*> Z ) - d ^ X ’ U )] + E ^ X > Z ) - d ^ X ’= S «< + 

xec„ X£C V 

Let us expand the first term 

(8) S u i — ^ ^ ^ V.r,j i d (Pxij,Zij) ^ i'Pxjj i ^ij )] — ^ ^ S u ij 

X&C U j j 

2.2.2. Generalized Ward's relation for 5±. Now we consider a selected basic dissimilarity 
Si{p x ,t) = (Px - t ) 2 . We get (omitting ij-s) 

Suij — ' IT^ ( (Px %) (Px u) ] — ' W x ( Z -px Z T 2jl 

a'gC u xec u 

as we know flU): = vj u u 

= w u z 2 — 2 w u u(z — 'll) — = ic n (z — u) 2 . 

Therefore 

Dij{C u , Gy) = w u (z- u ) 2 + uv,( 2 : - u) 2 
= w u (z 2 — 2uz + v 2 ) + u^z 2 — 2uz + u 2 ) 

= z 2 (w u + n't,) — 2 z(w u u + n^u) + w u u 2 + u^u 2 . 

We can express the new cluster leader’s element z also in a different way. 

w z z = P z = Y w *Px = ^2 WxPx + ^2 WxPx = WuU + WvV 
xec u uc v xgc u xeC v 

Therefore 

w u u + w v v 

z = -. 

W U + Wy 

This relation is used in the expression for Dij(C u , C v ): 

P'ij (fSu , Cy ) W U U T WyV (W U T U)y ) z 

,W u U + WyV 2 


= w u u 2 + - (w u + Wy)(- 


W U + Wy 


W U + Wy 

and finally, reintroducing i and j, we get 


(u — v) 2 . 


(9) 


D(C U ,C V ) = Y ai Y 


Wuij * ^vi j / \9 

-—- —[Uij ~ Vij) 

U)uij H - 'Wyij 








VLADIMIR. BATAGELJ, NATASA KEJZAR, AND SIMONA KORENJAK-CERNE UNIVERSITY OF LJUBLJANA, SLOVENIA 


a generalized Ward’s relation. Note that this relations holds also for singletons C v = | X ) 

or C v = {y}, x,y € U. 

Special cases of the generalized Ward’s relation. When w Xi = w x is the same for all variables 
Vi,i = 1,... ,m, the Ward’s relation ([9|) can be simplified. In this case we have 


t — 

l i ~ 


1 

w c 


Y Wx ' Px i’ 

X£C 


where the sum of weights wc = xec Wx i = Sa'gC Wx independent of i. Therefore we 
have 

n/n n \ Wu ' Wv SW / 

D{C U ,C V ) = —— r — ) 


w u + w v 


2 Wu ' W v . . 

= -;-«(U, V), 


W u + W v 


In the case when for each variable Vi all w Xi = 1, further simplifications are possible. 
Since J2xeC 1 = |C| we get 


tj * \C\ 2 
1 1 xec 


Pz 


and 


D(c "' a)= ^Ty^ s “- (u 


,2 |C„| - |C„| „ , 

‘“ v,) = icJTKy‘ i(u - v) ' 


2.3. Huygens Theorem for 5\. Huygens theorem has a very important role in many 
fields. In statistics it can be related to the decomposition of sum of squares, on which 
the analysis of variance is based. In clustering it is commonly used for deriving clustering 
criteria. It has the form 


( 10 ) 


TI = WI + BI, 


where TI is the total inertia, WI is the inertia within clusters and BI is the inertia between 
clusters. 

Let tx t den ote the leader of the cluster consisting of all units U. Then we define 
( Batageli. V. . 1988 ) 


TI = Y d ( X ^ u) 

veu 

wi = p( c) = y Y d ( x ^ 

cecxec 

BI = Y d (tc,t\j) 

CGC 


For a selected dissimilarity d and a given set of units U the value of total inertia TI 
is fixed. Therefore, if Huygens theorem holds, the minimization of the within inertia 
WI = P( C) is equivalent to the maximization of the between inertia BI. 

To prove that Huygens theorem holds for 8\ we proceed as follows. Because of the 
additivity of TI, WI and BI and the component-wise definition of the dissimilarity d, the 
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derivation can be limited only to a single variable and a single component. The subscripts 
i and j are omitted. We have 


TI-WI 


22 22 Wx [( Px - tu ) 2 ~ ( P x ~ tc ) 2 ] 

cecxec 

2^, 2^j W x ( — 2 p x t\j + tjj + 2 Px tc + tc) 

cecxec 

'y ) ( — 2 wc tc t\j + wc tjj + 2 wc tc + wc tc) 

Cec 

y wc{tc - tc) 2 = '22 tu) = bi 

Cec Cec 


This proves the theorem. In the transition from the second line to the third line we 
considered that for 5\ holds (Eq. ([4]) ) X^YeC w xPx = wctc- 


3. Example 


The proposed methods were successfully applied on different data sets: population pyra¬ 
mids, TIMSS, cars, foods, citation patterns of patents, and others. To demonstrate some of 
the possible usages of the described methods, some results of clustering of selected subset 
of the European Social Survey data set are presented. 

The data set ESS ( 2Q10l l is an output from an academically-driven social survey. Its 
main purpos e is to gain insight into behavior patterns, belief and attitudes of Europe’s 
populations CE SS www. 120121 ). The survey covers over 30 nations and is conducted bien¬ 
nially. The survey data for the Round 5 (conducted in 2010) consist of 662 variables and 
include more than 50,000 respondents. For our purposes we focused on the variables that 
describe household structure: (a) the gender of person in household, (b) the relationship to 
respondent in household (c) the year of birth of person in household and (d) the country of 
residence for respondent, therefore also the country of the household. From these variables 
(the respondent answered the first three questions for every member of his/her household) 
symbolic variables (with counts of household members) were constructed. Variable V 3 was 
the only numeric variable and therefore a decision had to be made of how to choose the 
category borders. From the economic point of view categorization into economic groups 
of working population is the most meaningful, therefore we chose this categorization in 
the first variant (V 3 a according to working population denoted with WP). But since demo¬ 
graphic data about age are usually aggregated into five-year or ten-year groups, we also 
consider ten-year intervals as a second option of a categorization of the age variable (V 3 & 
in 10 -year intervals denoted with AG). 


• V\: gender (2 components): 

{male : f u , female : fi 2 } 

• V 2 : categories of household members (7 components, respondent constantly 1): 
{respondent : f 2 \ = 1 , partner : f 22 , offspring : f 2 s, parents : f 2 4 , siblings : 
f' 25 -i relatives : f 2 6 , others : f 2 7 } 
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• V 3 : year of birth for every household member: 

— V^: according to working population (5 components): 

{0 — 19 years : / 34 , 20 — 34 years : / 3 a, 35 — 64 years : / 33 , 65 + years : 
/ 34 , NA : / 35 } 
or 

— Vzb: 10 -year groups (10 components): 

{0-9, 10-19, 20-29, 30-39, 40-49, 50-59, 60-69, 70-79, 80+, NA} 

• V 4 : country of residence (26 components, all but one with value zero): 

{ Belgium : / 41 ,..., Ukraine : / 4 , 26 } 


There were 641 respondents with missing values at year of birth therefore it seemed 
reasonable to add the category NA to variable VJj. That variant of handling missing values 
is very naive and could possibly lead to biased results (i.e. it could be conjectured that 
birth years of very old or non-related family members are mostly missing so they could 
form a special pattern). A refined clustering analysis would better use one of the well 
known imputation methods (e.g. multiple imputation, Rubin. D.B.I ( 19871 1 1. Note that for 
each unit (respondent) in the data set the components of variables V\ to V-j sum into a 
constant number (the number of all household members of that respondent). For the last 
variable V 4 , the sum equals 1. 

Design and population weights are supplied by the data set for each unit — respondent. 
In order to get results that are representative of the EU population, both weights should be 
used also for households. Because special weights for households are not available, we used 
weights provided in data set in our demonstration: each unit’s (respondent’s) symbolic 
variables were before clustering multiplied by design and population weight. wy i (used in 
the clustering process) for variables V\ to V 3 was then the number of household members 
multiplied by both supplied weights and for V 4 the product of both weights alone. 


3.1. Questions about household structures. Our motivation for clustering this data 
set was the question what are the main European household patterns? And further, does 
the categorization of ages of people from households influence the outcome of best clus¬ 
tering results a lot? To answer this part, two data sets with different variable sets were 
constructed: (a) data set with ages according to working population (further denoted as 
variable set WP) and (b) with ages split in nine 10-year groups (further denoted as vari¬ 
able set AG). Clustering was done on the three variables (gender, category of household 
members and age-groups of household members). 

Since we were interested also whether the household patterns differ according to coun¬ 
tries, we inspected variable ’country’ after clustering. However this does not answer the 
following question Does the country influence the household patterns? To be able to say 
something about that, ’country’ has to be included in the data set and a third clustering 
on data with all four variables was performed. 


3.2. Clustering process. The set of units is relatively large — 50,372 units. Therefore 
the clustering had to be done in two steps: 
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( 1 ) cluster units with non-hierarchical leaders method to get smaller ( 20 ) number of 
clusters with their leaders; 

( 2 ) cluster clusters from first point (i.e. leaders) using hierarchical method to get a 
small number of final clusters. 

The methods are based on the same criterion function (minimization of the cluster errors 
based on the generalized squared Euclidean distance with 8 = 8\). 

10 runs of leaders clustering were run for each data set (two data sets with three variables 
and one with four variables (including country)). The best result (20 leaders) of them for 
each data set was further clustered with hierarchical clustering. The dendrogram and 
final clusters were visually evaluated. Where variable ’country’ was not included in the 
data set it was plotted later for each of the four groups and its pattern was examined. 
Generally more runs of leaders algorithm are recommended. Since this example serves as 
an illustrative case and in ten runs the result was shown to be very stable we used ten runs 
only, but in actual application more runs of the leaders method would be recommended. 
The number of final clusters was selected with eyeballing the dendrogram selecting to cut 
where dissimilarity among clusters had the highest jump (apart from clustering in only two 
groups). 

3.3. Results. The results were very stable. In Figure[l]the dendrogram on the best leaders 
(with minimal leaders criterion function) for the variable set WP with ages according to 
working population {Vi,V 2 ,I / 3 a} is presented. For other two clusterings, i.i.e.e. for the 
variable set AG with 10-year age groups {Vj, Vj, V^} and the variable set Co with ’country’ 
included {Vi, V 2 , V 35 , V 4 } the dendrograms look similar and are not displayed. The variable 
distributions of the final clusters are presented on Figure [2] for the variable set WP, on 
Figure [3] for the variable set AG, and on Figure [ 6 ] for the variable set with Co. 

There is one large (C^ p with 24,049 households), one middle sized (C^ p with 11,909 
households) and two small clusters {C^ p with 7,134 and C^ p with 7,280 households) in 
the result for the variable set WP. For the variable set AG, three relatively medium size 
clusters (C^ G with 12,658, Cf G with 14,975, and C^ G with 15,800 households) and a small 
one (p 2 G with 6,939 households) were detected. 

Inspecting variable distributions, one can see (as expected) that gender is not a signifi¬ 
cant separator variable. The other two variables however both reveal household patterns. 
From Figure [2] and Figure [3] we see that most of the patterns can be matched between the 
clustering results with WP (working population age groups) and AG ( 10 -year age groups): 
C™ p with C^ G ; P with C^ G ; cluster Cf G is split into two clusters in the WP cluster¬ 
ing Cf p and C’Y P ■ We see that C± G would fit well with C^ P too which is not surprising 
because the working population category is very broad (it includes 30 years, so three to 
four 10 -year age groups). 

The differences in clustering results are observed due to different categorization. The 
WP categorization reveals less due to less categories, however it does show the separation 
of two household patterns where mostly two people (couples) live, C] VP and p . Some 
are still at work and the others (that sometimes live with some other family member) are 
already retired. AG categorization puts these two groups in the same cluster C\. We 
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see that those ’still at work’ are actually near retirement (they are about 50-60 years old) 
and they naturally fall into the same group. AG categorization on the other hand due 
to more age categories shows some more difference in the case of relationship patterns 
(a) respondent-parent-sibling, C^ G , and (b) respondent-partner-offspring, C^ G . These two 
relationship patterns reveal core families with (a) a respondent being in his/hers twenties 
and (b) a respondent being around 30-50 years old. In these two ’types of families’ C^ G are 
about 10 to 15 years older than C ^ G . The cluster C± G however shows additional pattern 
that WP categorization does not reveal - the extended families with more females and a 
very specific household age pattern. 

Considering also supplementary variable ’country’ which was not included in the these 
two clusterings (Figure 0|) we found that household pattern C± G (extended family) has 
the largest percentage in Ukraine, Russian Federation, Bulgaria and Poland. C 2 G with 
younger questionnaire respondent in the core family is relatively most frequent in Israel, 
Slovenia, Spain and Czech Republic and with older respondent, C% G , in The Netherlands, 
Greece, Spain and Norway. Respondents living in mostly two-person families are most 
frequently interviewed in Finland, Sweden, Denmark and Portugal. Since ESS is one of the 
surveys that should represent the whole population the very large (and very small) relative 
values for country should exhibit a kind of household pattern that can be observed in each 
country (i.e. large families in Russian Federation, Spain and Ukraine). 

These differences should be even more pronounced when ’country’ is included in the 
clustering process. The best clustering split the data set Co (with included additional 
variable ’country’) into one very large cluster (C 3 with 27,759 households), one medium 
sized (C 4 with 14,488 households) and two small clusters (C 2 with 2,120 and C\ with 6,005 
households). Figures [5] and [ 6 ] show the results. Note that for easier observation scales 
for percentages in the horizontal coordinates are different. Immediately we can notice the 
cluster C 2 with dominating extended families in Russian Federation. This cluster is also 
the smallest. The largest cluster C 3 (core family with small proportion of other members 
in the household) is most evenly distributed among countries, but most pronounced in 
Ukraine and Spain. Shares in the second smallest cluster Cj with core families and younger 
respondent are still large in Israel, Slovenia, Czech Republic but now also for Poland and 
Croatia. We could conjecture that in these countries offspring stay with parents long 
before becoming independent. The cluster C 4 belongs to older two- to three-person families 
with large proportions of German, Finnish, Swedish and Danish households. This type of 
households is the least evenly distributed among countries. 

4. Conclusion 

In the paper versions of well known leaders nonhierarchical and Ward’s hierarchical 
methods, adapted for modal valued symbolic data, are presented. Since the data measured 
in traditional measurement scales (numerical, ordinal, categorical) can all be transformed 
into modal symbolic representation the methods can be used for clustering data sets of 
mixed units. Our approach allows the user to consider, using the weights, also the original 
frequency information. The proposed clustering methods are compatible - they solve the 
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Figure 1. Dendrogram for best leaders clustering for variable set WP with 
working population age groups. 


same optimization problem and can be used each one separately or in combination (usually 
for large data sets). The optimization criterion function depends on a basic dissimilarity 
5 that enables user to specify different criteria. In principle, because of the additivity of 
components of criterion function, we could use different 6 s for different symbolic variables. 

Presented methods were applied on the example of household structures from the ESS 
2010 data set. The clustering was done on nominal (gender, relationships, country) and 
interval data (age groups). When clustering such data information on size (which is impor¬ 
tant when design and population weights have to be used to get the sample representative 
of a population) was included into the clustering process. 


The proposed approach is partially implemented in the R-package clamix (Batagelj, V., Kejzar, N 
201 Oi l. 
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Figure 2. Variable distributions for final 4 clusters (C'[ l p with 7,134, C^ p 
with 24,049, p with 7,280, and Cf p with 11,909 households) with work¬ 
ing population age groups. 
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Figure 3. Variable distributions for final 4 clusters for variable set AG 
(Cf G with 14,975, C^ G with 6,939, C^ G with 15,800, and C^ G with 12,658 
households) with 10 -year age categories. 
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Figure 4. Supplementary variable country for variable set AG. 


Amsterdam 

Batagelj, V. and Kejzar, N., Clamix - Clustering Symbolic Objects. Program in R (2010) 
https://r-forge.r-project.org/projects/clamix/ 

Bock, H. H., Diday, E. (editors and coauthors) (2000), Analysis of Symbolic Data. Ex¬ 
ploratory methods for extracting statistical information from complex data. Springer, 





































































































































































































CLUSTERING OF MODAL VALUED SYMBOLIC DATA 


19 


Cluster 1 


Cluster 2 


United Kingdom 

1 

Ukraine 

_1 

Switzerland 

_1 

Sweden 

_1 

Spain 
Slovenia 
Slovakia 
Russian Federation 
Portugal 
Poland 
Norway 
Netherlands 
Israel 
Ireland 
Hungary 
Greece 
Germany 
France 

_1 

- 1 

1 

m 

_ i 

_i 

i 

_ i 

_ i 

_i 

_i 

_ i 

_i 

Finland 

_i 

Estonia 

Denmark 

_i 

_i 

Czech Republic 
Cyprus 
Croatia 
Bulgaria 
Belgium 

_ i 

_i 

_i 

_i 

i 

0 5 10 15 20 25 

percentage 

Cluster 3 

United Kingdom 

_i 

Ukraine 

_ i 

Switzerland 

_ i 

Sweden 
Spain 
Slovenia 
Slovakia 
Russian Federation 
Portugal 
Poland 
Norway 
Netherlands 
Israel 
Ireland 
Hungary 
Greece 

_I 

_i 

_I 

i 


_i 

_i 

_i 

_ i 

_i 

_i 

_ i 

_I 

Germany 

_ i 

France 
Finland 
Estonia 
Denmark 
Czech Republic 
Cyprus 
Croatia 
Bulgaria 
Belgium 

_i 

_i 

_ I 

_ i 

_i 

_ i 

_I 

_i 

i 


30 40 50 60 70 80 

percentage 


United Kingdom 
Ukraine 
Switzerland 
Sweden 
Spain 
Slovenia 
Slovakia 

zz 

ZZI 

1 

Russian Federation 

_1 

Portugal 


Poland 

1 

Norway 

1 

Netherlands 

ZD 

Israel 

1 

Ireland 

I 

Hungary 

1 

Greece 

1 

Germany 

1 

France 

1 

Finland 

1 

Estonia 

I 

Denmark 

1 

Czech Republic 

1 

Cyprus 

ZZI 

Croatia 

1 

Bulgaria 

1 

Belgium 

1 


0 10 20 30 40 50 60 


percentage 



Cluster 4 

Ukraine 

_1 

Switzerland 

_1 


1 

Spain 

ZD 

Slovenia 

_1 

Slovakia 

_1 

Russian Federation 

_1 

Portugal 

_1 

Poland 

_1 

Norway 

_1 

Netherlands 

1 

Israel 

Ireland 

_1 

Hungary 

_1 

Greece 

_1 

Germany 

_1 

France 

_1 

Finland 

_1 

Estonia 

_1 

Denmark 

_1 

Czech Republic 

_1 

Cyprus 

_1 

Croatia 

_1 

Bulgaria 

_1 

Belgium 

1 


10 15 20 25 30 35 40 


percentage 


Figure 5. Variable country for variable set Co with 10-year age categories. 
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Figure 6 . Variable distributions for final 4 clusters for variable set Co with 
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Appendix 

In this Appendix we present derivations of entries t* c , z and D(C U , C v ) from Tablc|T]for 
different basic dissimilarities 5. 

Since in our approach the clustering criterion function P( C) is additive and the dissim¬ 
ilarity d is defined component-wise, all derivations can be limited only to a single variable 
and a single its component. Therefore, the subscripts i (of a variable) and j (of a compo¬ 
nent) will be omitted from the expressions. 

In derivations we are following the same steps as we used for di in Section 2. To obtain 
the component t* c of representative of cluster C we solve for a selected 5 the one dimensional 
optimization problem (|2|). Let us denote its criterion function with F(t),t > 0 

F(t) = ^2 w x S(p x ,t ) 
xeC 

then the optimal solution is obtained as solution of the equation 

=VW=o. 

df 

To obtain the component z of the leader of a cluster C z we use the relation for t* c for a 
cluster C z and re-express it in terms of quantities for C u and C v . 

Finally, to determine the between cluster dissimilarity D(C U , C v ) for a selected 5 we will 
use the relation ([6]) and following the scheme for <5i the auxiliary quantity S u from ([8]) 
(after omitting indices i and j) 
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(11) S u = ^2 w x(S{p x ,z) - S(p x ,u)) 

xec u 

For different combinations of the weights used in the expressions, the abbreviations 
wc, Pc, Qci He and Gc from Tableware used. Note that for C z = C U UC V and C u nC v = 0, 
we have R z = R u + R v , for R € { w , P, Q , H, G}. 

4.1. 5%(x,t) = • The derivation of the leader t* c \ In this case 

F(t ) = ^2 w x f Px ^ and F'(t) = -2 ^ w x p x (p x - t = 0. 

xec ' ' xeC 

The leader’s component is determined with 

, 10 \ ** _Hx&C w xP 2 x _Qc 

r c — - — "5”• 

ExeC w xPx Pc 

The derivation of the leader z of the merged disjoint clusters C u and C v : From 
the Eq. (fl2l) and C u FI C v = 0 follows 

Qz Qu 4“ Pu'U 4“ pT 


z = 


Pz Pu 4 - P v P i 4- P v 

The derivation of the dissimilarity D(C U ,C V ) between the disjoint clusters 

and Cp Since u is the leader of the cluster C u it holds Q u = P u u (see Eq. (fT2]) i. We can 
replace EveC„ Wx 'Px. = Qu in the expression S u (Eq. (fTTTl ) 


s u — x 

xec u 


Px - Z 


z 


Px-U 


u 


with P u u and get 


S u = P 


(u — z) 2 


uz- 


Pu f u - z 
U V z 


Similary S v = — (-—- 

V \ z 


Combining both expressions (Eq. ©) we get 


D(C U , C v ) = 


P u ( U - z 


+ 


P, / v — z 


4.2. 8z(x,t) = t — . The derivation of the leader t* c : In this 


case 


m = E 


w x 


{Px ~ tf 
t 


and F\t) = ( t2 ~Pl )^2 = °- 


xec xec 

The square of the leader’s component is determined with 

2 _ Exec w xPx _ Qc 
Exec Wx w c 


( 13 ) 


j.* £ _ 

t C — 
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and from it 



The derivation of the leader z of the merged disjoint clusters C u and C v : From 
Eq. (fTU|) and C u fl C v = 0 follows 


2 Qz 

Z = - 

Wz 


Qu 4“ Qv 

W u + W v 


W u U 2 + W v V 2 


and z = 


W v 


W v 


I W u U 2 + W v V 2 
Wu + w v 


The derivation of the dissimilarity D(C U ,C V ) between the disjoint clusters C u 

and C v : Since u is the leader of the cluster C u and Q u = w u u 2 (see Eq. (USD) , we can 
replace ^2 X eC u Wx Px = Qu i n the expression S u 


Su 


w * 

XeC u 


' {Px ~ Z ) 2 
z 


(p x - uj 2 
u 


with w u u 2 and get 


Similary S v = w v 



S u = w, 


{u 



Combining both expressions we get 


D (Cy , Cy j - 


(u — z ) 2 


+ w. 


[y — z ) 2 


4.3, «„(*, 

m = E 


The derivation of the leader t* c : In this case 


w 3 


x&c 


Px - t 
Px 


1 


and F'(t) = -2 ^ w x (p x -1)— = 0. 

xeC Px 


For p x ^ 0 (this is also the condition for £ 4 ( 2 ;, t) to be dehned), the leader’s component is 
determined with 


(14) 


f* — 

l C — 


2-^xec Px 

E W X 

XdC rri 


He 

G C 


In the case p x = 0 we set t* c = 0 and £ 4 ( 2 ;, t) = 0. 

The derivation of the leader z of the merged disjoint clusters C u and C v : From 
Eq. (TT41) and C u fl C v = 0 follows 

_ H z _ H u + H v _ H u + H v 
- ~G~ Z ~ G u + G v ~ ^ + Y 

The derivation of the dissimilarity D(C U ,C V ) between the disjoint clusters C u 

and C v : Since u is the leader of the cluster C u and H u = G u u (see Eq. (fl4l) b we can 




















CLUSTERING OF MODAL VALUED SYMBOLIC DATA 


25 


replace YxeC u % 


H u in the expression S u 



with G u u and get 
Similary S v = G v (v 


zf. 


S u = G u (u — z) 2 . 

Combining both expressions we get 


D(C U , C v ) = G u (u - zf + G v (v- z) 2 . 


case 


4.4. S 5 (x,t) = ^ Px p — ■ The derivation of the leader t* c : In this 

F(t) = ^2 w x~ —— and F'(t) = -2 ^ w x (p x -t)— = 0. 


xec 


Px 


xec 


Px 


For p x ^ 0 (this is also the condition for S^(x,t) to be defined), the leader’s component is 
determined with 


(15) 


t*C = 


ZLvec Wx 

Wx 

Z-.XGC ^ 


WQ 

H c ' 


In the case p x = 0 we set t* c = 0 and 5^{x, t ) = 0. 

The derivation of the leader 2 of the merged disjoint clusters C u and C v : From 
Eq. (fl5l) and C u n C v = 0 follows 


„ _ Wz_ _ w u + w v 
H z H u + H v 

The derivation of the dissimilarity D(C U ,C V ) between the disjoint clusters C u 

and C v : Since u is the leader of the cluster C u and w u = H u u (see Eq. (fT5]0 . we can 
replace Y xeC„ w x = w u hr the expression S u 


S u = 


xeC, 


w x 


(Px - Z ) 2 (p x - uf 


Px 


Px 


with FL u u and get 


Similary S v = w v 


<11 — ? i 2 

S u = H u [u - z) 2 = w u -. 

u 

-. Combining both expressions we get 

7 ^ ^ \ ... (u-z) 2 t ( V-Z) 2 

DyCsui ^v) — “I - ^v 

U V 
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4.5. (x,t) = ■ The derivation of the leader t* c : In this 

(Px ~ t ) 2 


case 


m = E 




and F'{t) = ^ ~ (f ~ pI) = 0 


Px ■ t 


Px 


xgC ^ x&c 

For p x ^ 0 (this is also the condition for 8e(x,t) to be defined), the leader’s component is 
determined with 

(16) 

and from it 


2 _ zL/VeC w xPx _ Pc 


tc = 


E w x 

xec FT 


He 


tc = 


Pc_ 

H c 


In the case p x = 0 we set t* c = 0 and 5g(a:, t) = 0. 

The derivation of the leader z of the merged disjoint clusters C u and C v : From 
Eq. (fTHl) and C u D C v = 0 follows 


2 _ Pz _ 

: '~W~ H u + H x 


Pu + Py Pu + Pv 


Pu Pv 

v w + 


and z = 


' Pu P Pv 

Pu I Pv 

U 2 ' 


The derivation of the dissimilarity D(C U ,C V ) between the disjoint clusters C u 

and C v : Since u is the leader of the cluster C u and P u = H u u 2 (see Eq. (fl6]l ). we can 
replace Ylx&Cu w xPx = Pu in the expression S u 

'(p x -z) 2 ( Px~u ) 2n 


s«= E 


Wx 


XdCu 


Px ■ Z 


Px ■ U 


with H u v? and get 


= P 


(u - z ) 2 P u (u - zf 


U 2 Z 


u uz 

p v (v — z ) 2 

Similary S v = -. Combining both expressions we get 


v vz 


D(C U ,C V ) = 


P u (u - z ) 2 p v (v- zf 


uz 


vz 



















